Recap: Architecture of an LSTM in general

Where from left to right $ f_t $ is the forget gate (which values to forget), $ i_t $ is the input gate (which values to update), $ \tilde{C}_t $ is the update gate (adds filtered input to cell state), $ o_t $ is the output gate.

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. It is kind of like a conveyor belt (Förderband) that runs down the entire chain. The LSTM can add or remove information to the cell state, regulated by structures called gates.

Gates are composed out of a sigmoid neural net layer and a pointwise multiplication operation. Sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through.

A LSTM has three of these gates, to protect and control the cell state.

  • Forget gate layer: Sigmoid layer that decides which element of the cell state $C_{t-1}$ is kept to what extend (number between 0 and 1).

E.g. in a language model: the gender of the (old) present subject in cell state should be forgotten once a new subject is seen.

The next step is to decide what information we're going to store in the cell state. This has two parts:

  • Input gate layer: Decides which values we'll update (matrice with values between 0 and 1 through sigmoid).

Next a tanh layer creates a vector of new candidate values, $\tilde{C}_t$ (between -1 and 1 through tanh), that could be added to the state based on $i_t$.

E.g. Add the gender of the new subject to the cell state.

In order to update the old cell state, $C_{t-1}$, into the new cell state $C_t$, we multiply the old state by $f_t$, forgetting the things we decided to forget earlier and then add $i_t*\tilde{C}_t$, the new candidate values (scaled by how much we decided to update each state value).

E.g. Actually drop information about old subject's gender and add the new information.

  • Output gate layer: Decide what to output. First run a sigmoid layer to decide what parts of the cell state we're going to output. Then, we put the (updated!) cell state through tanh to push the values to be between -1 and 1 and multiply it by the output of the sigmoid gate.

E.g. It might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

Implementation of univariate time series prediction


In [27]:
import numpy as np
from matplotlib import pyplot as plt

In [35]:
def normalise_windows(window_data):
    normalised_data = []
    for window in window_data:
        normalised_window = [((float(p) / float(window[0])) - 1) for p in window]
        normalised_data.append(normalised_window)
    return normalised_data

In [37]:
# transform and load time series
def load_data(filename, seq_length, normalise_window):
    f = open(filename, 'rb').read()
    data = f.decode().split('\n')
    sequence_length = seq_length + 1
    result = []
    # divide raw data into sequences -> goes until index: len(data) - seq_len
    for index in range(len(data) - sequence_length):
        result.append(data[index:index+sequence_length])
    
    # necessary as 
    if normalise_window:
        result = normalise_windows(result)
        
    result = np.array(result)
    # train/test split -> 0.9/0.1
    row = round(0.9 * result.shape[0])
    train = result[:int(row), :]
    # np.random.shuffle(train)
    x_train = train[:, :-1]
    y_train = train[:, -1]
    x_test = result[int(row):, :-1]
    y_test = result[int(row):, -1]
    
    # reshape vectors to fit LSTM input in keras
    # keras expects: (N, W, F) where N is the number of training sequences, W is the sequence length and F is the number of features of each sequence
    x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1)) # f = 1
    x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))

    return [x_train, y_train, x_test, y_test]

In [ ]:


In [24]:
import os
import warnings

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' #Hide messy TensorFlow warnings
warnings.filterwarnings("ignore") #Hide messy Numpy warnings

In [29]:
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
import time

# build LSTM model with network structure of [1, 50, 100, 1]
# 1 input layer (sequence of size 50) 
# 1 LSTM layer with 50 neurons
# 1 LSTM layer with 100 neurons
# 1 fully-connected layer of 1 neuron with a linear activation function
def build_model(layers):
    model = Sequential()
    
    model.add(LSTM(input_dim=layers[0], output_dim=layers[1], return_sequences=True))
    model.add(Dropout(0.2))
    
    model.add(LSTM(
        layers[2],
        return_sequences=False))
    model.add(Dropout(0.2))
    
    model.add(Dense(
        output_dim=layers[3]))
    model.add(Activation("linear"))
    
    start = time.time()
    model.compile(loss='mse', optimizer='rmsprop')
    print('> Compilation Time: ', time.time() - start)
    return model

In [33]:
def predict_point_by_point(model, data):
    # Predict each timestep given the last sequence of true data, in effect only predicting 1 step ahead each time
    predicted = model.predict(data)
    predicted = np.reshape(predicted, (predicted.size,)) # to 1d vector
    return predicted

In [31]:
# train model and predict point
epochs = 1 # due to triviality of the sinwave
seq_len = 50

X_train, y_train, X_test, y_test = load_data('sinwave.csv', 50, False)

model = build_model([1, 50, 100, 1])

model.fit(X_train, y_train, batch_size=512, nb_epoch=epochs, validation_split=0.05)

predicted = predict_point_by_point(model, X_test)

plt.plot(y_test)
plt.plot(predicted)
plt.show()


> Compilation Time:  0.026865005493164062
Train on 4233 samples, validate on 223 samples
Epoch 1/1
4233/4233 [==============================] - 8s - loss: 0.1882 - val_loss: 0.0344

In [52]:
# train model and predict stock prices
epochs = 1
seq_len = 50
global_start_time = time.time()
    
X_train, y_train, X_test, y_test = load_data('sp500.csv', seq_len, True)

model = build_model([1, 50, 100, 1])

model.fit(X_train, y_train, batch_size=512, nb_epoch=epochs, validation_split=0.05)


> Compilation Time:  0.026327133178710938
Train on 3523 samples, validate on 186 samples
Epoch 1/1
3523/3523 [==============================] - 7s - loss: 0.0021 - val_loss: 0.0011
Out[52]:
<keras.callbacks.History at 0x12beff240>

In [53]:
def predict_sequences_multiple(model, data, window_size, prediction_len):
    #Predict sequence of 50 steps before shifting prediction run forward by 50 steps
    prediction_seqs = []
    for i in range(int(len(data) / prediction_len)):
        curr_frame = data[i * prediction_len]
        predicted = []
        for j in range(prediction_len):
            predicted.append(model.predict(curr_frame[np.newaxis, :, :])[0, 0])
            curr_frame = curr_frame[1:]
            curr_frame = np.insert(
                curr_frame, [window_size - 1], predicted[-1], axis=0)
        prediction_seqs.append(predicted)
    return prediction_seqs

In [54]:
def plot_results_multiple(predicted_data, true_data, prediction_len):
    fig = plt.figure(facecolor='white')
    ax = fig.add_subplot(111)
    ax.plot(true_data, label='True Data')
    # Pad the list of predictions to shift it in the graph to it's correct start
    for i, data in enumerate(predicted_data):
        padding = [None for p in range(i * prediction_len)]
        plt.plot(padding + data, label='Prediction')
        plt.legend()
    plt.show()

In [55]:
predictions = predict_sequences_multiple(model, X_test, seq_len, 50)
#predicted = lstm.predict_sequence_full(model, X_test, seq_len)
#predicted = lstm.predict_point_by_point(model, X_test)        

print('Training duration (s) : ', time.time() - global_start_time)
plot_results_multiple(predictions, y_test, 50)


Training duration (s) :  27.288140058517456

In [ ]: